Q-ViT: Accurate and Fully Quantized Low-Bit Vision Transformer

23

network is formulated as

Forward: Q-Linear(x) = ˆx · ˆw = αxαw((Qa(x) + z/αx)Qw(w)),

Backward: J

∂x =J

ˆx

ˆx

∂x =

J

ˆx

if x[Qx

n, Qx

p]

0

otherwise

,

J

w =J

∂x

∂x

ˆw

ˆw

w =

J

∂x

∂x

ˆw

if w[Qw

n , Qw

p ]

0

otherwise

,

(2.14)

where J is the loss function, Q(·) is applied in forward propagation. At the same time, the

straight-through estimator (STE) [9] is used to retain the gradient derivation in backward

propagation.denotes the matrix multiplication with efficient bit-wise operations.

The input images are first encoded as patches and pass through several transformer

blocks. This transformer block consists of two components: Multi-Head Self-Attention

(MHSA) and Multi-Layer Perceptron (MLP). The computation of attention weight de-

pends on the corresponding query q, key k and value v, and the quantized computation in

one attention head is

q = Q-Linearq(x), k = Q-Lineark(x), v = Q-Linearv(x),

(2.15)

where Q-Linearq, Q-Lineark, Q-Linearv denote the three quantized linear layers for q, k, v,

respectively. Thus, the attention weight is formulated as

A =

1

d

(Qa(q)Qa(k)),

QA = Qa(softmax(A)).

(2.16)

Training for Quantized ViT.

Knowledge distillation is an essential supervision ap-

proach for training QNNs, which bridges the performance gap between quantized models

and their full-precision counterparts. The usual practice is to use distillation with attention,

as described in [224]

Ldist = 1

2LCE(ψ(Zq), y) + 1

2LCE(ψ(Zq), yt),

yt = arg max

c

Zt(c).

(2.17)

2.3.2

Performance Degeneration of Fully Quantized ViT Baseline

Intuitively, in the fully quantized ViT baseline, the information representation ability de-

pends mainly on the architecture based on the transformer, such as the attention weight

in the MHSA module. However, the performance improvement brought about by such an

architecture is severely limited by the quantized parameters, while the rounded and dis-

crete quantization also significantly affects the optimization. The phenomenon identifies

that the fully quantized ViT baseline bottleneck comes from architecture and optimization

for forward and backward propagation.

Architecture bottleneck.

We replace each module with the full-precision counterpart,

respectively, and compare the accuracy drop as shown in Fig. 2.5. We find that quantizing

query, key, value, and attention weight, that is, softmax(A) in Eq. (2.16) to 2 bits brings

the most significant drop in accuracy between all parts of ViT, up to 10.03%. Although

the quantized MLP layers and the quantized weights of the linear layers in MHSA result in